DOCUMENT RESUME 

ED 290 795 TM 870 586 



AUTHOR 
TITLE 

?UB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Littlefield, John Troendle, G. Roger 
Effects of Rating Task Instructions on Consistency 
and Accuracy of Expert Raters • 
Apr 87 

9p.; Paper presented at the Annual Meeting of the 
American Educational Research Association 
(Washington, DC, April 20-24, 1987). 
Reports - Research/Technical (143) — 
Speeches/Conference Papers (150) 

MFOl/PCOl Plus Postage. 

Dentistry; Evaluation Methods; Higher Education; 
*Holistic Evaluation; *Interrater Reliability; 
*Scoring; *Test Reliability 
*Rater Reliability 



ABSTRACT 

The effect of different types of rating task 
instructions on rater behavior was examined using experts, as opposed 
to novices, as raters. The experts were instructed to (1) form a 
global categorical judgment (early hypothesis generation); (2) assess 
19 detailed elements; or (3) both. Subjects were 8 dental faculty 
members who ranged in age from 28 to 60 and who had at least 2 years 
of teaching experierce. The task was to evaluate five three-fourth 
dental crown preparations twice, using each of the three types of 
rating instructions. Intrarater and interrater agreement and 
reliability were assessed, as was the level of rater accuracy. Higher 
coefficients of rater reliability, but not agreement, resulted from 
the global and combined instructions, compared with the 19-point, or 
traditional, instruction. The global judgment alone improved 
reliability over traditional instructions, but intrarater agreement 
was lower. Expert rater consistency was higher when early hypothesis 
generation and self-monitoring were encouraged by the rating 
instructions. There were no significant differences in score accuracy 
among the three types of instruction. (MGD) 
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Effects of Rating Tesk Instructicns on Consistency and Accuracy of Expert Raters 



Background 

Following a comprehensive review of the perfa^mance rating research, Landy andFerr (1980) 
roommended a moratorium on rating form r^rch. They noted that the rater's information 
processing serves as a cc^ltlve filter o' the measurement data and that we need to better understand 
the cognitive process^ raters use in making juc^ents. Almost synchronously, major theoretical 
articles appeared addressing Issues such as co^itive models to account for halo error (Cooper, 
1 98 1 ) Old automatic vs. cOTscliwsly-controlled rater judgment ( Feldman, 1 98 1 ). Sophisticated 
cognitive research is now being cwiducted to improve our understanding of the influence of relevant 
rater knowledge on halo error (Kozlowskl et a1. , 1 986) and rater cognitive simplification strategies 
(Cadwell and Jenkins, 1986). Cadwell and Jenkins describe the "rater as the measuring 
instrument," reflecting a profound ^ift in the focus of performence ratings research. This study 
will use cogiitlve processing of experts as a conceptual frmnework to predict changes In rater 
behavior in response to dlfferait types of rating task instructions. 

Research in co^itive psychology has ©cpanded air understanding of cognitive processing by 
experts. Unlike novices, experts form highly el^ated cognitive representations of a problem 
(Larklnelal.. 1980). Expertsorganlzeknowledgestructuresover long periods of learning and 
experience (Glaser, 1984). Wto» faced v/ith a problem, experts automatically (i.e., without 
conscious efftrt) construct an Initial higlj (pjallty representation of the problem. Their knowledge is 
"chunked" abound principles m! abstractions which subsume surface features of the problem and 
their perceptions are influenced by pattern recogjition processes ( Brandsford, et. al. , 1 986). In 
medical diagnosis, for example, expert physicioTS generate hypotheses early-on, and the corro^t 
diagnosis is very likely amcmg those hypotheses (Ntrmai, 1 985). in the public schools, teachers 
"size up" students es individuals, grouping them very quickly, and these initial estimates remain 
quite stable (Stigglns, et ai, 1986). CoQnltlve psychology res^rch Is providing an insight into the 
power of human thinking to use a lerg: !'nowledge ba^ in an efficient and automatic manner. 

Most performance ratings rese^h has u^ nonexpert raters as subjects, typically college 
students, in real life training environments, however, raters ere frequently experts at liie tasks 
being rated (e.g.. physiciais, teachers). When jwfeing a performance, an expert rater is likely to 
quickly construct en initial n^resentatiw of the perfw-mance. That representation will include 
knowledge about the appropriateness of various performanue elements to solving the problem at hand. 
For example, in ccm&ictlng a c'TSical exan . a medical student may follow the correct procedure but 
overlook a disease finding, thereby seriously compromising the validity of t^° entire exam. An 
expert rater woild ju(^ the performance as a failure while a novice rater would simph/ note a step 
that was overloi^ed. 

Reco^izlng that raters m expa'ts sug^ts a very different role for the performaice rating 
form, instead of being m "instrument" which defines how observations ere to be mode, the rating 
form could be (tesi^ed to facilitate communicating mi (^antlfylng observations by the expert. The 
Instructions on tte proposed rating form would request a global cat^rlcal judgment (early 
hypothesis generation) plus assessment of various detailed elements of the performance. The detailed 
assessment wojld serve as a stimulus fir rater self-monitoring to verify the initial "early 
hypothesis" and also provide documentation for the rationale used in making the global judgment. 

The gena^al appears of the ratify form pro(»sal above is not a ralical departure from 
traditional fwmats; its role in relation to rater a^itive processes Is significantly changed 
however. Traditional instructions to the rcter are to observe the perfoi mance and mark numerous 
detailed criteria in a mechanical fasMon. MarMng the detailed criteria parallels a novice s approach 
to problem solving by collecting numerous miscellaneous facts (Larkin et. al., 1980). Traditional 
rating instructions include calculating a score by summ ing the marks. This scoring approach reflects 
psychometric tKt tlmy, which attributes considerable specificity and measurement error to 
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individual items, the summed scxire therefore will be a more generalizeble measure of overall 
performKice(Nunally, 1978). The prc^josedinstructicm emphasize the expert's global j 
with the assessment of detailed performance elements saving an ancillary role. 

If the proposed rating task instructions facilitate quantifying expert observations, the resullant 
ratings should display improved consister^ mi accuracy when compared to traditional rating 
instructions. Rater consistency has two cwnponents: ( 1 ) agreement - tte extent to which different 
judges make exactly the s^e judpmt; and (2) reliability - the degree to which ratings by different 
judges &rd proporticr^l wh^ express as dsviations from their respective mean scores (Tinsley and 
Weiss, 1 975). Hogai d a1. ( ? 986) have recently demonstrated the impcrtance of calculating both 
measures of rater consist^^ in an applied settir^. Lord ( 1 983) recommends that rater accuracy is 
best assess by calculating Receive (Weting Curv^ (ROC). ROC analysts graphically displays 
the trade-off between the prob^ility of m^ing true fMJSitive vs. false positive decisions. In applied 
(fec::jon-malcing ^tir^, one must always judge uetweoi the relative cost of making false positive 
vs. false negative errors. 



nethtxls 

Subjects were eight dental fa^jlty memb^ who rangal in age from 28 to 60 years. All subjects 
had two more years of clinical teaching experieixs ^ had p^ticipated in developing the detailed 
rating criteria used in this study. 

The rating task in this experiment is a routine part of the subjects' daily job responsibilities. 
Th^ stated purpose of the exp^imental task was to rAantterdize g^adir^ methocfe. The task was to 
evaluate five 3/4 crown pr^jerations twice using «jch of three differwit types of rating instructions. 

1 . Traditional -mark each of 1 9 crit^ia m a three-categtry scale (Acceptable, needs 
improvement or ilnsstisfectory), ''Be sure to m^^k each crit^imi either A, I , or U." A single 
composite score was calculated ex post facto by the investigators. 

2. Olobal Juf^meni Only - "After inspecting the tooth, 
writeyourg^adB(4,3,2,l,0). As i5i clinic, 4 is the best grade and 0 is a failure." 

3. Combin&l - "After marking the criteria, assip a acccrding to the grade code provided" 
The Combined instructions condit'Iw is an ettempt to conform the rating task instructions to the 
cognitive processes of expsw^ts. Presumably, tne rati^ would initially form m early hypothesis about 
the tooth being judged then review ^i)e detailed criteria to confirm the hypothesis. The Olobal 
Judgment Only condition provides a middle g^nd betwe^ /r^i^////^^^ rater instructions and 

Combined instructiciis. Olobal Ji/dgment should improve rater consistency end axnirxy 
over the Tr&Jitional instructions, but tte oWition of ctetailed cr it^ia in the Combined condition 
should Improve conslst^Hy ^ occuracy even further by serving as a self-monitoring check for the 
expert raters to verify their initial hyi^the^. 

Data collection procedures were etescribed in detail by Troendle ( 1 983). Raters were assigned 
code numbers to maintain anonymity m& were not informed that they were r^aluating the same 
five teeth. Teeth were identified cmly by code numbers, and at le^t six weeks Intervened betw^n 
each trial session. 

Data ^lysls procedure adtfiressed two genial questions: 

1 . Do the three typ^ of rating instructiwis result in different levels of intra rater end 
inter rater coiisist^? 

2. Do the thr^ types of rating instriK^ions result in different levels of rater accuroiy? 

As noted earlier, rater ca^istemy has two comfwrnents: agreement reliability. Intrarater and 
Interrater agreement were assessed uslr^ a tau coefficient suggested by TInsley end Weiss ( 1 975). 
lau consists of a Chi Squcre test to «isure that (Aservcd a^^eement exceeds chance levels followed 
by calculation of f^cent ajreem^t coeffici^t adjustal down fen* chance egreements. Statistical 
siplflcance of dlffereroes In agreement levels mm the three types of ratlrg Instructions were 
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♦%ted by calculating Cochran's Q (SPSS, 1 984) using actual ( unadjusted) frequency of agreement 
data. IntraratK* reliability was assessed using Kendall's tau-'j correlation crefficient while 
interrater reliability was assessed using an intraclass correlation coefficient recommended by Fmr, 
( 1 970). Rater accuracy was assessed by calculating thr^ Receiver operating curves ( RX) baseo 
upon Si^al Detection Theory (Swets and Pickett, 1 982). Data from the pairs of trials (1-2, 3-4, 
& 5-6) were pooled to calculate the RX for each type of rating instructions. Statistical 
slgnii'lcance of differences in accuracy among the three types of rating instructions was assess by 
calculating critical ratios among the various pairs of the three curves ( Metz, et. al. , 1 984). 
Interrater consistency is the upper limit of rater accuracy in thb same sense that test reliability is 
the upper bound of test validity. 



Figure 1 displays the design of the siud/, descriptive statistics and various intrarater and 
interrater consistKicy coeff iciwits. 

Figure 1 - Study Design, Descriptive Statistics and Consistency Ccefficients 
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1. Ad^isted (Sovn fM' a^'^nsnts (!u9 to choice 



Intrarater agreement was defined as rater / making the same judgment about tooth / both times 
the tooth was judged using a given set of rater instructions. Forty pairs of judgments were made 
unifer eah condition. Levels cf intrarater a^^ent among the three types of instructions did not 
differ si^ificantly wh«i tte unadjusted freqi^ies of ay-eement were tested by Cochran's Q 
(0=3.84, p=. 1 5). Inter rater agreemait was defined as all 8 raters marking ± 1 category of the 
modal jud^ent for that tooth. A total of 10 teeth were judged under each condition. Levels of 
interrater agreement among the three types of instructions did not differ significantly when tested by 
Cochran's Q(Q= 1.1 4, p=.S7). 

The intrarater reliability coef. .w-nts for the three typra of rating instructions are substantially 
different The upp^* limit of the 955? confidence interval f«r the Traditional instructions 
coeff-iient is .65 which does not include the other two coefficients. The tau-b coefficients were each 
based upon 40 matched pairs of jiKlpents. The intwrater reliability coefficients also are 
substantially different. The upper limit of the 95« confidence interval for the Traditional 
instructions coefficient is .52 which does not include the eiobai Judgment Only instructions 
coefficient The intraclass cturelation osfficients were calculatai using the resickjal mean square 
from aone way ANOVA for each of the three groups of 80 scores ( 5 teeth x 8 raters x 2 trials). 
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Figure 2 presents Receiver Operating Curves (ROC) for each of the three types of rating 
instructions. Each curve is based upon 80 sctwes. To real the curves, note that at . I probability uf a 
false positive decision, the Tradftfonal instrucHions result in a .78 probability of a true positive 
decision while the Olotial Only and the OomDfn&f instructions result in a .66 proDaDlllty. one 
measure of accuracy using ROC analysis is the proportion of the unit square area covered by the curve 
( 1 OOX is perfect accuracy). The Traditi>m1 instructions curve covers 93 % of the area wh lie 
the 01(^1 Oniy curve cova^ 89^ and the Qmbrn&J curve cova^ 87^. A correlated 
observations critical ratio test of the vsr ious pairs of curves ( Metz, et. al. ,1984) did not reject Ihe 
null hypothesis that the various pairwise sets of rating (teta were samples drawn from the same 
UiidR*lying ROC curve. A virtual inspection of the cur/es confirms the statistical finding. 

Figure 2 - Receiver Operating Cnrves 
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The Cmbined end Olc^l Ji^ent Only instructions resulted in higher coefficients of 
rater reliability, but no difference in rater a^eemwt whai compa^ io the Trsc/ftfonal 
instructions. The highK" relidjility coefficients indicate greater discrimination power ( i.e., more 
confiderae in renk orttering the five teeth from best to worst). Mathematically, the differences are 
due to greater variance among raters judging the same tooth using TrMUanal Instructions ( l.a , 
the within tooth meai square is larger). The differences between intrarater rel iabi 1 ity of 
Qmbin&i and Traditional instructionsalsoexist whenscoringisdoneby summingjudgmentsof 
the detailed cr Iteria (LlttleflBld and Troendle, 1986) thus the differences in rater reliability 
apparently are not an artifact of the categorical sewing method. Taken together , the hi^er rater 
reliability coefficients iridicate that ^Zf//^ mi Global Jt^ent (Mly instructions causa 
expw't raters to procfejce scores which ere more numerically precise thm Traiitimal 
instructions. Perhaps being instriKted to 'xm^ eadi criterion" with no reference to an overall 
j udpent ( i.e. , the Traditiml instructiwjs) disrupts the early hypothesis aeration process 
which experts typically use in m^ing ju<^ent3. Marking detailed crito'ia in the Combined 
instructions improved the intrarater reli^ility coefficient in comparison to Olobal Judgment 
Only (.81 vs. .68), but resulted in subtly Iowa* Interrater reliability (.47 vs. .54). 

Levels of rater e^ment were not si^if icantly different ranong the three types of rating 
instructions. Tte intrarater ogreranwit levels uwlw Combined instriKrtions were just short of 
statistical significaice when tested ^inst the Olobal JiM^snt Only instructiw^s 
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(.63 vs. .41 . p<.06) again reflecting a possible benefit due to rater self- monitor ing when marking 
the detailed criteria The lock of statistically significant differeiK^s among interrater agreement 
levels may be due to low statistical power (n= 1 0 in each group). In general , the consistency 
coefficients in Figure 1 indicate that the Combined and Global Judgment Only instructions 
have moderate- to- high reliability, but moderate- to- low agreement. Hogan et. al. ( 1 986) also found 
differeiices between reliability and agreement in ratings of nursing home patient disability and 
recommended calculating both indices. Future performance ratings research should assess both of 
these espects of rater consistency. 

There were no statistically sigif icant differences in rater accuracy, however , the statistical power 
of the tests was low. A power analysis of th3 (teta ( Metz et. al., 1 984) indicated only a .23 probabi- 
lity of detecting a signficant difference between the Traditional and Combined instructions 
curves. In order to achieve a .75 probisbf lib/ of detecting a statistically significant difference, 
approximately 300 judgements would be needed as compared to 80 in this study. Three of the t^th 
received a majority judgment of "clinically acceptable" ( 72-94S agreement) while two were 
judged to be "clinically unacc6ptd3le" ( 73 & 86S agwient). Future studies of rater accuracy 
should usee larger number of items to be judged with a wider diversity in judpent difficulty in 
order to increase the statistical power. 

Ratings from the ^lW7^//3a/ m\ Global Judgment Only conditions correlate. 74 with each 
other but only .59 wtd .56 respectively with Traditi(ml ratings. The Traditional instructions 
result in more stringent decisions than the Combined and Olobal Judgment Only instructions 
( 50X failure rate vs. 29X and 1 9^). T*en togetter . these correlations and differences in 
stringency of ratings from ^ condition support the internal validity of the study, namely that 
rather modest changes in rating task instruMions effort the judgments of expert raters. 

The conclusions from this study are m^r^ by at least two weaknesses. First, the raters knew 
the data were for research purposes. Landy and Farr ( 1 980) noted that ratings for admlnlstrallve 
purposes will be more lenient than those f(r re^arch purpo^. Raters in this study may have 
performed differently if the scores were to be used to determine student grades. A second weakness is 
the failure to use a randomized block design. One could argue that the raters learned the teeth in 
rating them six times. The teeth were numbered with different-colored ink and tape for each tr ial 
and stored loosely in a box. Posthoc conversatiwis with the raters did not indicate that they 
recognized the teeth. With a small cohesive group of subjects, attempting to hove different groups 
simultaneously using different rating instn*ctions was deemed unfeasible. 

Future research in this area should use a larger number of stimuli with more diverse levels of 
judgment difficulty in order to improve the powa of the tests of differences among the resulting 
Receiver Operating Curves. It might be advoita^ous to make tJ^e overall judgment in the Combwed 
rating instructions ind^endent of what is marked on the entailed criteria One can never develop 
rating criteria whidi anticipate all possible outcomes, therefore, the printal criteria should be 
viewed as a sample of all (»3ssible cM't^io which could relate to the overall juc^ent. Construct 
psychology (Button, \ 985) offers a possible techrwlogy for idenLfying the general constructs used 
by experts to make judpents in their field of expertise. With a better understamling of cognitive 
processing by expert raters, rating forms and their related instructions could be designed to more 
effectively facilitate quantifying and cwnmunicating judpents. 

The results of this study ack^ess ^r^al classification decisions but do not adc^^ the problem of 
. ater agreement in marking detailed criteria. Interrater agreement In ms^klng the detailal cr iter la 
was 958 for the Traditional rating instructions, not significantly higL<T than ^reement due to 
chance. For the Combined instructions, the agfsem^t levol was 36SK. Neither level is very 
encouraging when viewed frwn the perspatlve of providing formative feedback to stuctents to improve 
future performance. 



Conclusions 



This study suggests that giving expert raters instructions that request a global categorical 
judgment supplemented by marKtng detailed criteria results m nigner mtrarater and mterrater 
reliability than instructions that focus on marking each detailed criterion without reference to the 
overall judgment ( i.e., traditional instructions). Giving rating instructions that request only a 
global judgmKit Improves reliability over traditional Instructions, but Intrarater agreement is 
ajmewhat lower ttian when both a global judgment and detailed criteria assessmait are requested The 
results are interpreted from a conceptual framework of rarly hypothesis generation and 
self-monitoring by experts. The pattern of consistency coefficients support a theoretical prediction 
of higher expert rater consistency when early hypothesis generation and self-monitoring are 
encouraged by the rating instructions ( I.e. , rating Instructions which paralleled expert cognitive 
processing resulted in better reliability among expert raters). There were no significant 
differences in score accuracy among the three types of.rating instructions. 
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